Time - Based Coordinated Checkpointing by Nuno

نویسندگان

  • NUNO F. NEVES
  • Ravishankar Iyer
  • Jane Liu
  • Laxmikant Kale
چکیده

Distributed systems are being used to support the execution of applications ranging from long-running scientific simulators to e-commerce on the Internet. In this type of environment, the failure of one of its components, either a computer or the network, may prevent other components from completing their tasks. Since the probability of failure increases with the number of computers and execution time, it is likely that these applications will be interrupted unless provision is made for failure handling. In this thesis we address the problem of fault recovery in distributed systems. The thesis describes two variations of a coordinated checkpoint protocol that uses time to remove most causes of overhead, and to avoid all types of direct coordination. The time-based protocol does not have to transmit extra messages, does not need to tag the application messages, and only accesses the stable storage when the checkpoints are saved. The thesis also describes a new coordinated checkpoint protocol that is well adapted to mobile environments. It uses time to indirectly coordinate the creation of new global states, and it saves two different types of checkpoints to adapt its behavior to the current network characteristics. Traditional techniques for fault diagnosis in distributed systems, either based on watchdogs or polling, exchange performance with detection latency. The thesis introduces a complementary mechanism that uses the error codes returned by the stream sockets. Since these errors are generated automatically when there is communication with a failed process, the mechanism incurs only in small overheads. Our results show that, in most cases, failures could be located using only the errors from the sockets. iii A large number of checkpoint-based recovery protocols have been proposed in the literature, however, most of them were never evaluated. The thesis describes the design and implementation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. RENEW-Recoverable Network of Workstations provides a flexible set of operations that facilitates the integration of checkpoint and rollback recovery protocols. iv To my father, Marinho Ferreira Neves. for their many contributions to this thesis. Special thanks are due to Professor William Sanders for his help on the treatment of experimental data. To Jenny Applequist and Jill Comer for carefully reading the papers and for correcting their imperfections. To Pedro Trancoso for his constant encouragement and friendship. Thanks are also due to my friends Venkataraman for their help during …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Coordinated Checkpointing Without Direct Coordination

Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Longrunning parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper, we des...

متن کامل

Adaptive Checkpointing with Storage Management for Mobile Environments

Conclusions ~~ The limited stable storage available in mobile-computing environments can make traditional checkpointing and message logging umuitable. Since storage on a mobile liost is not considered stable, most protocols designed for these environments save the checkpoints on base stations. Previous approaches have assumed that the base station always has sufficient disk space for storing ch...

متن کامل

Using Time to Improve the Performance of Coordinated Checkpointing

This paper describes and evaluates a coordinated checkint protocol that uses time to eliminate several performance overheads that are present in traditional protocols. The time-based protocol does not have to exchange coordination messages, does not need to add information to the processes' messages, and only accesses stable storage when checkpoints are saved. This protocol uses a simple initia...

متن کامل

Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems

Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. In this paper, we propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both th...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998